Skip to content

Janus pro mix dataset#13

Open
Mark-ZhouWX wants to merge 3 commits into
SamitHuang:janusfrom
Mark-ZhouWX:janus
Open

Janus pro mix dataset#13
Mark-ZhouWX wants to merge 3 commits into
SamitHuang:janusfrom
Mark-ZhouWX:janus

Conversation

@Mark-ZhouWX

Copy link
Copy Markdown
Contributor

What does this PR do?

Add janus pro mix dataset

@Mark-ZhouWX Mark-ZhouWX requested a review from SamitHuang as a code owner March 17, 2025 01:41
@SamitHuang

Copy link
Copy Markdown
Owner

/gemini

@gemini-code-assist

Copy link
Copy Markdown
Contributor

It looks like you might have intended to run a command. The /gemini command requires a subcommand. For example, to request a code review, you can use /gemini review. To get a summary of the pull request, you can use /gemini summary.

If you need a list of all available commands, please use /gemini help.

@SamitHuang

Copy link
Copy Markdown
Owner

/gemini review

@SamitHuang

Copy link
Copy Markdown
Owner

/gemini summary

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a MixDataset to combine text, VQA, and text-to-image datasets for mixed-task training. The implementation has a few areas for improvement in terms of correctness, efficiency, and adherence to best practices. My review includes suggestions to fix the dataset order, use more efficient numpy operations, and clean up minor code style issues.

from janus.train.vqa_dataset import VqaDataset


class MixDataset:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's an extra space between class and the class name. According to PEP 8, there should be a single space.

Suggested change
class MixDataset:
class MixDataset:

Comment on lines +18 to +24
def cumsum(sequence):
r, s = [], 0
for e in sequence:
l = len(e)
r.append(l + s)
s += l
return r

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This manual implementation of cumsum can be simplified and made more efficient by using numpy.cumsum. Since numpy is already a dependency, this would be more idiomatic and concise. bisect works fine with NumPy arrays.

Suggested change
def cumsum(sequence):
r, s = [], 0
for e in sequence:
l = len(e)
r.append(l + s)
s += l
return r
def cumsum(sequence):
return np.cumsum([len(e) for e in sequence])

self.default_image_shape = default_image_shape
self.max_token_length = max_token_length
self.datasets = datasets
self.num_dataset = len(datasets)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The instance variable self.num_dataset is assigned but never used. It should be removed to avoid confusion and keep the code clean.

# add image and image_seq_mask item to pure text for batching
if dataset_idx == 0:
image = np.zeros(self.default_image_shape, np.float32)
image_seq_mask = np.zeros((self.max_token_length), dtype=np.bool)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The np.bool alias is deprecated since NumPy 1.20 and will be removed in a future version. You should use the standard Python bool type instead for forward compatibility.

Suggested change
image_seq_mask = np.zeros((self.max_token_length), dtype=np.bool)
image_seq_mask = np.zeros((self.max_token_length), dtype=bool)

Comment on lines +103 to +107
sample_weights = []
assert len(sample_ratios) == len(datasets)
for i in range(len(sample_ratios)):
weight = sample_ratios[i] * len(mix_dataset) / len(datasets[i])
sample_weights += [weight] * len(datasets[i])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The loop for creating sample_weights can be written more concisely and efficiently using NumPy operations. Using np.repeat is more idiomatic for this kind of task and can be faster for large datasets.

Suggested change
sample_weights = []
assert len(sample_ratios) == len(datasets)
for i in range(len(sample_ratios)):
weight = sample_ratios[i] * len(mix_dataset) / len(datasets[i])
sample_weights += [weight] * len(datasets[i])
assert len(sample_ratios) == len(datasets)
weights_per_dataset = np.array([
r * len(mix_dataset) / len(d) for r, d in zip(sample_ratios, datasets)
])
counts = [len(d) for d in datasets]
sample_weights = np.repeat(weights_per_dataset, counts).tolist()

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

This pull request introduces a foundational component for multi-modal training within the Janus Pro project. It provides a robust mechanism to combine and sample from disparate datasets—pure text, VQA, and text-to-image—into a single, coherent data stream for MindSpore-based models. This enhancement is crucial for training models that require exposure to a wide variety of data types to improve their generalization and performance across different tasks.

Highlights

  • New Dataset Implementation: I've added a new MixDataset class that intelligently combines multiple types of datasets, specifically pure text, Visual Question Answering (VQA), and text-to-image data. This allows for a unified training pipeline across diverse data modalities.
  • Flexible Data Loading Utility: A create_mix_dataloader function has been introduced to simplify the configuration and loading of this mixed dataset. It supports various parameters like dataset paths, token lengths, image sizes, and crucially, allows for weighted sampling to control the proportion of samples drawn from each underlying dataset.
  • MindSpore Integration: The new data pipeline leverages MindSpore's dataset API, utilizing GeneratorDataset, DistributedSampler, and WeightedRandomSampler to ensure efficient and scalable data loading for training.
Changelog
  • examples/janus/janus/train/mix_dataset.py
    • Added MixDataset class: This class acts as a wrapper around multiple datasets, providing a unified interface to iterate over them. It includes a cumsum static method to manage cumulative sizes for indexing and ensures consistent output by adding placeholder image/mask data for text-only samples.
    • Implemented create_mix_dataloader function: This function orchestrates the creation of individual TextDataset, TextImageDataset, and VqaDataset instances, combines them into a MixDataset, and then wraps it with MindSpore's GeneratorDataset and appropriate samplers (DistributedSampler, WeightedRandomSampler) to form the final dataloader.
    • Included an example usage block (if __name__ == "__main__":): This demonstrates how to initialize the VLChatProcessor and create a mixed dataloader, providing a quick way to test the new functionality.
Activity
  • The author, Mark-ZhouWX, opened this pull request to add the Janus Pro mix dataset.
  • SamitHuang initiated a /gemini command, followed by requests for a /gemini review and /gemini summary.
  • The gemini-code-assist[bot] provided several review comments, including a PEP 8 style suggestion for class definition spacing, a recommendation to use numpy.cumsum for efficiency, identification of an unused instance variable (self.num_dataset), a warning about the deprecated np.bool type, and a suggestion for a more idiomatic NumPy approach to calculate sample_weights using np.repeat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants